Skip to content

Conversation

@ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Add check if the byte length over int.

Why are the changes needed?

We encounter a very extreme case with expression concat_ws, and the error msg is

Caused by: java.lang.NegativeArraySizeException
	at org.apache.spark.unsafe.types.UTF8String.concatWs

Seems the UTF8String.concat has already done the length check at #21064, so it's better to add in concatWs.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

It's too heavy to add the test.

@github-actions github-actions bot added the SQL label Apr 9, 2021
@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41707/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41707/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137129 has finished for PR 32106 at commit fceb65b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Allocate a new byte array, and copy the inputs one by one into it.
// The size of the new array is the size of all inputs, plus the separators.
final byte[] result = new byte[numInputBytes + (numInputs - 1) * separator.numBytes];
int intNumInputBytes = Ints.checkedCast(numInputBytes + (numInputs - 1) * separator.numBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use guava here instead of just checking numInputBytes > Int.MaxValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code just follow the concat that do the same check.

final byte[] result = new byte[Ints.checkedCast(totalLength)];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use Math.toIntExact, and replace Ints.checkedCast by toIntExact, please. I don't see any benefits of the Guava function over the standard function, do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41766/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41766/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137187 has finished for PR 32106 at commit 7411b6e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41775/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41775/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137196 has finished for PR 32106 at commit 7411b6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 1012 to 1013
int intNumInputBytes = Math.toIntExact(numInputBytes + (numInputs - 1) * separator.numBytes);
final byte[] result = new byte[intNumInputBytes];
Copy link
Member

@MaxGekk MaxGekk Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to prevent overflow, we should do that for all operators:

Suggested change
int intNumInputBytes = Math.toIntExact(numInputBytes + (numInputs - 1) * separator.numBytes);
final byte[] result = new byte[intNumInputBytes];
int resultSize = Math.toIntExact(Math.addExact(
numInputBytes,
Math.multiplyExact(numInputs - 1, separator.numBytes)));
final byte[] result = new byte[resultSize];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numInputBytes has changed to long. you mean (numInputs - 1) * separator.numBytes can overflow ? yeah, maybe it can be. How about this ?

    int intNumInputBytes = Math.toIntExact(
            numInputBytes + (numInputs - 1) * (long)separator.numBytes);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean (numInputs - 1) * separator.numBytes can overflow ?

yep

How about this ?

I am ok with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// The size of the new array is the size of all inputs, plus the separators.
final byte[] result = new byte[numInputBytes + (numInputs - 1) * separator.numBytes];
int intNumInputBytes = Math.toIntExact(
numInputBytes + (numInputs - 1) * (long)separator.numBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you re-check the indentation. I guess it should be smaller.

Comment on lines 1012 to 1014
int intNumInputBytes = Math.toIntExact(
numInputBytes + (numInputs - 1) * (long)separator.numBytes);
final byte[] result = new byte[intNumInputBytes];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you can put this in one line:

Suggested change
int intNumInputBytes = Math.toIntExact(
numInputBytes + (numInputs - 1) * (long)separator.numBytes);
final byte[] result = new byte[intNumInputBytes];
int resultSize = Math.toIntExact(numInputBytes + (numInputs - 1) * (long)separator.numBytes);
final byte[] result = new byte[resultSize];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41782/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41784/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41784/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137203 has finished for PR 32106 at commit 9081799.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137206 has finished for PR 32106 at commit 73f4824.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 12, 2021

+1, LGTM. Merging to master.
Thank you @ulysses-you and @maropu for your review.

@MaxGekk MaxGekk closed this in 1be1012 Apr 12, 2021
@maropu
Copy link
Member

maropu commented Apr 12, 2021

late lgtm

@MaxGekk
Copy link
Member

MaxGekk commented Apr 12, 2021

@ulysses-you Would you like to review other places in UTF8String and make similar changes. For instance:

byte[] data = new byte[this.numBytes + pad.numBytes * count + remain.numBytes];

@ulysses-you
Copy link
Contributor Author

@MaxGekk created ticket SPARK-35041

@ulysses-you ulysses-you deleted the SPARK-35005 branch April 13, 2021 04:05
@ulysses-you
Copy link
Contributor Author

created #32142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants